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The problem 

The Mann- Whitney- Wilcoxon (MWW) test is a statistical two-sample test, that is, a test which compares 
two sets of numbers. It comes in two forms. The first tests the null hypothesis that the two samples 
belong to the same distribution. This is the so-called two-tailed test. The second form tests the null 
hypothesis that either both samples have the same distribution or that the first sample is larger in a 
statistically significant way. If the null hypothesis can be rejected in this so-called one-tailed test, it can 
be concluded that the second sample is in fact larger. A strength of the MWW test in general is that it is 
distribution-free, i. e. it implies no assumptions about the actual distribution of the two samples. It is 
frequently used in the social and life sciences. 

Here, only the one-tailed test will be considered. Besides, in the following discussion it will be assumed 
that all elements of the combined two samples are different. The formulae which will be derived will 
follow Mann and Whitney's formulation of the test; the formulation of Wilcoxon is slightly different. 

The MWW test is a permutation test. It based on the idea that if the two samples have the same distri- 
bution, any quantity which you can compute from them will be invariant under permutations of values 
between, rather than just within the two samples. The quantity which is central to the MWW test in 
Mann and Whitney's formulation is called Mann- Whitney U: 

£ E 

siESi S2ES2 

Clearly, U cannot exceed |Si|*|S 2 |and will reach that limit only when all elements of Si are larger than 
those of S 2 . The smaller U, the higher the significance with which Si can be pronounced smaller than S 2 . 
(The one-tailed test is sometimes formulated the other way round, with the null hypothesis being that 
both samples have the same distribution or Si is smaller than S 2 . Then a sufficiently large U indicates 
that Si is larger than S 2 -) 

The combinatorial problem which we will use to compare programming languages is to compute the 
significance of this test exactly. There is an approximation which is acceptable for all but the smallest 
sample sizes (see Wikipedia), but here we will perform the exact computation. To do so, one has to 
compute the probability that two samples drawn from the same distribution result in the same or a 
smaller U. The number of possibilities of dividing | Si | + | S 2 1 values into two samples of size | Si | and 
| S 2 1 is given by the binomial coefficient ^ 'j ■ It remains to count the different choices for Si (and 
the corresponding S 2 ) which result in a Mann-Whitney U less than or equal to the given U (which is 
obtained by applying the formula for U to the data under consideration). 

I (cc) This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Germany License 1 


f 1 if SI > S2 
\ 0 otherwise 


The algorithm 


As explained briefly in the previous section, we want to compute the number of divisions of a set of 
numbers into two samples of given sizes so that a given maximal Mann-Whitney U is not exceeded. 
In this section we will choose a suitable parametrisation and develop the algorithm we will use in the 
implementations below. 

Considering the nature of the MWW test, the actual values drawn from the hypothetical distribution 
do not matter; U depends only on how they compare. We will assume them all to be different, and can 
without loss of generality think of them as the naturals from 1 to |Si I+IS 2 I- To simplify our notation, we 
will make some abbreviations. We will define n:= |Si I+IS 2 I, m:=|Si|and denote the given Mann-Whitey U 
with a small u. 

Our task is now to consider all choices for Si from the set {l,..,n} so that |Si|=m and count those for which 
U(Si, S 2 ={l,..,n}\Si)<=u. The result will be called l(n,m,u). The straightforward way of computing 1 
would be to loop over all possible sets Si and compute U for each of them. This is obviously the simplest 
way of doing it and will be improved upon. But let us first develop a parametrisation which allows us 
to consider each Si exactly once. The parameters we choose are the numbers of values belonging to S 2 
between neighbouring elements of Si when both samples are sorted together and written down in order. 
We denote these "gaps" in Si by dj. If there is no element of S 2 between successive values of Si, then 
the corresponding d;=0. The following example demonstrates the concept. The numbers called si l ' :> are 
the elements of Si, the S 2 are in S 2 , and in this example S 2 (l) is the smallest element of the complete 
sample. 


T 1 ) / „(2) / _(3) / (4) 


< So < So < S, < Si < si < S 


(A / „(2) 


do = 4 


’1 

d] = 0 


„(5) ^ ,(6) 

2 ' 


< < 


To select a specific Si, we will work our way up from the smallest values, first choosing do (the number of 
values starting from the smallest which belong to S 2 ; do is 0 if the smallest value belongs to Si). Then we 
compute the effect the following value, which is an element of Si, will have on the total U. It is clear that 
this smallest element of Si will contribute exactly do to U, because it exceeds the do smallest elements of 
S 2 but no others. Wouldn't it be nice if we could recurse and continue analogously with di? In fact this 
is possible: 

In order to treat all dj equally, we have to compute at each stage not only the contribution of the current 
element of Si to U, but also the amount by which the contribution of all larger elements of Si is made 
larger by the preceding elements of S 2 . We do this by regarding the given u as a budget from which we 
subtract these contributions. How much do we have to subtract? As stated above, the contribution of 
the first element of Si is do- The contribution of each of the other m-1 elements becomes larger by do 
because of these first elements of S 2 : They are smaller than all higher-up elements of Si. So in total we 
have to subtract m*do from our u budget. If it becomes negative, we know that the originally given U 
has already been exceeded, so we can abort the recursion early. 

Our algorithm can be written as the following recursion formula: 


t(n, m, u ) 


'0 

1 

n—m. 

Y f(n — d — 1, m — 1, u — m ■ d) 
. d= 0 


if u < 0 

if u > 0 A m = 0 

otherwise 


This is what we will implement. 


'M 


This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Germany License 


2 


Test cases 


In choosing the test cases for the benchmark comparison, I was influenced by the particular problem for 
which I had considered using the MWW test. In that application, the sizes of both samples are equal, so 
I chose them to be equal in the test cases as well. As a consequence, n=2*m in all cases. All the programs 
take m and u as their arguments, while n is omitted. The values chosen for m were a compromise: large 
enough for the runtime of the faster programming languages to be measurable but small enough that 
running the slower implementations was still feasible. The following table shows the three test cases 
and the computation results: 


m 

u 

l(2m,m,u) 

( 2 r) 

Dead ends 

Recursions 

14 

100 

21863028 

40116600 

2577355 

37694806 

16 

100 

91520024 

601080390 

22710753 

153132717 

22 

70 

21266626 

2104098963720 

30545833 

56321707 


The fourth column shows the number of choices of m out of 2m items, the total possibilities of choosing 
the two samples. This is given only for illustration, as our algorithm does without searching through all 
of them. The "dead ends" are the number of times the selection of the d; has to backtrack without finding 
a solution, that is the number of times the first branch in the recursion formula above applies. The last 
column shows the total number of calls to the function l(n,m,u) if the algorithm is implemented recur- 
sively, or equivalently the total number of stack push and pop operations for an iterative implementation 
(see below). 

The columns which are relevant for the computational complexity of the task are the first, the third, 
and the last two. The first column, m, determines the maximum depth of the recursion. The third, the 
end result, gives the number of increments to the variable containing the result. The last two columns 
(especially the last) determine the amount of recursion (or stack management) necessary to obtain the 
result. So our second test case is a variant of the first which is more computationally expensive in 
all respects. The third test case, however, has approximately the same result as the first, but a larger 
maximum recursion depth, and requires a more complex search of the tree of choices for the dj to arrive 
at the result. 

Programming languages and test platform 

The purpose of this investigation was to compare programming languages and to some extent their vari- 
ants and implementations. The programming languages considered primarily were C, Perl, Haskell and 
Java. C (next to Fortran) is the traditional choice for scientific computing, and still often regarded as 
delivering the best performance. Perl is a script language one does not associate with serious number- 
crunching, but which allows quick coding of small utilities for day-to-day tasks. Haskell is a pure func- 
tional programming language which is close to mathematics in its notation. Java is a just-in-time com- 
piled language which has been reported as now providing performance on a par with C. Two other 
languages were tested more or less systematically, mostly because they support arbitrarily large integers 
natively. They were Ruby, a multi-paradigm language that has evolved from Perl and others, and be, a 
calculator available on all UNIX systems. 

Besides comparing programming languages, we will take a look at parallel processing. In the forseeable 
future, processors will not primarily become more powerful through increases in clock frequency, but 
by having an increasing number of processing cores. This part of the investigation will center on the 
language C and compare different ways of parallelising the task. 

Lastly, a comparison between different variants of the same languages was performed. Different compil- 
ers were tested for C and Java. Because the application of the MWW test I originally aimed at would have 
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required integers wider than 64 bits, and because some languages natively support arbitrarily large inte- 
gers, 64-bit implementations and large-integer implementations are compared separately In most cases, 
both a recursive function implementation and an iterative (loop) implementation was tested. For the 
latter, the tree search of the values for the di was implemented using a manually allocated stack. Two 
large-integer packages available for Perl were also compared. 

All tests were carried out on a dual-core Intel Core2 Duo 6600 processor running at 2.4 GHz, with 1 
GB of RAM. The operating system used was Ubuntu Linux 6.06 x86_64 with a 2.6.23.8 kernel. For the 
duration of the tests, the computer was otherwise idle. The versions of the compilers or interpreters and 
supplementary packages used are listed in the table below. 


c 

Gnu gcc 

4.0.3 

c 

Intel icc 

9.1 

c 

GMP library 

4.1.4 (libgmp.so. 3.3.3) 

Perl 

Interpreter 

5.8.7 

Perl 

bigint 

0.07 

Perl 

Math: :BigInt: :GMP 

1.24 

Java 

Gnu gcj/gij 

4.1.0 

Java 

Sun JDK/JRE 

1.5.0_06 

Haskell 

GHC 

6.8.2 

Ruby 

Interpreter 

1.8.7 

be 

Gnu 

1.06 


Results 

Development effort - Measuring runtimes - Machine-sized integers - Large integers - Memoization - Parallel 


Development effort 

Before we start looking at runtimes, here is a very approximate comparison of the time it took to code 
the algorithm in each language. Some people maintain that this is even more important than runtimes, 
as programmer time is much more expensive than computation time. Though coding time has to be 
considered, I do not think this is true in most cases. Often a computation is part of an iteration of trying 
to solve a larger problem, with successive changes depending on the previous round's result. Then a 
slow computation can waste a lot of human time, and putting some effort into reducing runtimes will 
pay off. In my experience, the case where a numerical program is coded to be run only once or twice is 
extremely rare. In those cases, it certainly makes sense to choose the programming language which is 
quickest to code. 


Language 

Development time 

C 

« 1 day 

Perl 

Hours 

Java 

« 1 day 

Haskell 

Hours 

Parallel C 

Days 


This table has to be taken with a pinch of salt. It reflects not just properties of the languages, but also 
which programming languages I know well (C and Perl), and which I do not (Haskell and Java). That 
said, Haskell looks an especially interesting option for quick coding, in particular for a mathematical 
problem such as the one we consider here. Despite having very little experience at it, I managed to 
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write the program rather quickly. In addition, it left me with the feeling that the other implementations 
benefited from my previously working with Haskell, as its spartan, mathematical notation helped to 
concentrate on the algorithm. This advantage may not apply to problems less purely mathematical than 
this one, or to programmers less theoretically interested than me. 


That Perl allows quick coding is probably no surprise. It is known as a language for quickly writing 
small utilities, and its weak typing, intelligible error messages and DWIM (Do What I Mean) principle 
all help to keep down development time. Java might be slightly faster to code than C for someone who 
knows it better than I do. Lastly, the parallel C implementations were most time-consuming. This may 
be due to that fact that I am no expert at parallel programming and that there were some unexpected 
effects, which are discussed below. 


Ruby and be were omitted from the development time table, as I wrote these implementations last. 
Due to having coded the algorithm several times before, the last implementations mostly amounted to 
copying. Both languages would probably rank with Perl if the program had been coded from scratch in 
one of them. 


Measuring runtimes 


Runtimes were determined using the t ime command. The time used for the comparison was the wall- 
clock ("real") time, not CPU time: If a program has to perform some IO for some unknown reason, that 
will slow it down and should be counted towards its total. The mean of at least three runs was taken 
unless the runtime was larger than a quarter of an hour. If the first run showed a significantly larger 
run time than the following runs, that was ascribed to the program, interpreter or auxiliary files being 
loaded into memory, and the run was discarded. For the compiled languages, the compile time was not 
included in the benchmark. 


Results for machine-sized (64-bit) integers 


The following table gives the runtimes for the 64-bit implementation. The first column gives the lan- 
guage and the compiler /interpreter, the second shows whether the implementation was recursive or 
iterative. The following columns show the programs' runtimes for the three selected sets of arguments. 
The rows starting "Java Gnu compiled" contain the results for Java code compiled to native machine 
code with gcj (not bytecode interpreted). While compiling Java code to native machine code precludes 
distributing byte code compatible across systems, this is primarily of interest for preserving trade secrets 
(which distributing the source code would not), not usually a concern in a research environment. 
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Language 

Implementation 

14/100 

16/100 

22/70 

C gcc -03 

recursive 

0.36 s 

1.1 s 

0.36 s 

C gcc -03 

iterative 

0.27 s 

0.94 s 

0.15 s 

C icc -03 

recursive 

0.43 s 

1.47 s 

0.38 s 

C icc -03 

iterative 

0.18 s 

0.66 s 

0.11 s 

Perl 

recursive 

46 s 

3 min 6 s 

1 min 4 s 

Perl 

iterative 

29 s 

1 min 44 s 

19 s 

Java Gnu 

recursive 

19.7 s 

1 min 20 s 

28.8 s 

Java Gnu 

iterative 

3.1 s 

10.5 s 

1.9 s 

Java Gnu compiled -03 

recursive 

0.41 s 

1.46 s 

0.53 s 

Java Gnu compiled -03 

iterative 

0.29 s 

0.92 s 

0.21 s 

Java Sun 

recursive 

0.35 s 

1.23 s 

0.46 s 

Java Sun 

iterative 

0.39 s 

1.1 s 

0.28 s 

Java Gnu gcj / Sun JRE 

iterative 

0.39 s 

1.1 s 

0.28 s 

Java Sun JDK / Gnu gij 

iterative 

3.1 s 

11 s 

1.9 s 


The first four rows show that gcc is better or worse than the Intel compiler, depending on the program: 
gcc is better with the function-recursive code, while icc is better with the iterative implementation. As 
expected, Perl is much slower than C (about by a factor of 100). If the program is not to be run too often 
and with too large parameters, its ease of coding and modification (see below) might still make it an 
option, though. 

Java lives up to its new reputation as a real alternative to C and other compiled languages, but not with 
the Gnu bytecode interpreter gij. It is the slowest option by a significant factor, especially so for the 
recursive implementation. The bottom two lines in the results table show that this is due to the speed of 
the Gnu interpreter, not its bytecode compiler. Sun's interpreter runs bytecode generated with the Gnu 
compiler gcj as quickly as that generated with Sun javac. 

Gnu gcj is special in that it can also compile Java to machine code. As one would expect, this is one of the 
best solutions. But the comparison between gcj's machine compiled code and Sun's bytecode interpreter 
mirrors the results of gcc and icc: For the recursive implementation, the Sun interpreter is faster, for the 
iterative code, it is gcj-generated machine code. 

Independently of the programming language it can be said that a loop implementation is faster than one 
using functional recursion. So if maximum performance is required, this is what should be implemented. 
That said, the difference is moderate in most cases (except when using the Gnu Java interpreter), so 
readability considerations may be allowed to overrule runtime concerns. 

In addition to the runs displayed in the table, the Java bytecode interpreter Kaffe, version 1.1.6, was tried 
out. It took 45 seconds for the loop implementation with arguments m=14, u=100, and due to this bad 
result no full benchmark comparison was performed. One further benchmark was performed which is 
not quite comparable with the others. The Octave language, intended for matrix computations in science 
and engineering, supports only floating-point computations. (Though conversions to and from integers 
are possible for storage, no integer arithmetic is possible.) The range of non-negative integer values that 
can be stored exactly in a double-precision floating-point variable is from 0 to 2 52 -l, sufficient for our test 
cases. The runtimes for a recursive implementation for Octave 2.1.72 were 16 min 6 s, 1 h 5 min and 21 
min 29 s, respectively. Octave therefore need not be considered for this type of combinatorial problems. 


Results for arbitrarily large integers 

Before presenting the results for arbitrarily large integers, an important caveat has to be noted. Our test 
cases all fit into 64 bits and therefore do not necessarily require arbitrarily large integers. So the following 
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results are not affected by the large integer toolkits' efficiency of allocating additional storage. What is 
tested, however, is the overhead resulting from the additional control flow necessary to decide whether 
a computation will overflow. 


Language 

Large integers 

Implementation 

14/100 

16/100 

22/70 

C gcc -03 

xyssl 

recursive 

0.74 s 

2.8 s 

0.7 s 

C gcc -03 

xyssl 

iterative 

0.64 s 

2.5 s 

0.53 s 

C icc -03 

xyssl 

recursive 

1.0 s 

3.4 s 

0.84 s 

C icc -03 

xyssl 

iterative 

0.67 s 

2.6 s 

0.58 s 

C gcc -03 

gmp 

recursive 

0.53 s 

1.9 s 

0.46 s 

C gcc -03 

gmp 

iterative 

0.39 s 

1.49 s 

0.29 s 

C icc -03 

gmp 

recursive 

0.59 s 

2.1 s 

0.53 s 

C icc -03 

gmp 

iterative 

0.37 s 

1.41 s 

0.28 s 

Perl 

bigint / Math: :BigInt: Calc 

recursive 

2 h 42 min 

10 h 54 min 

3 h 45 min 

Perl 

bigint / Math: :BigInt: Calc 

iterative 

1 h 58 min 

6 h 59 min 

1 h 17 min 

Perl 

Math: :BigInt: CMP 

recursive 

15 min 13 s 

1 h2min 

22 min 14 s 

Perl 

Math: :BigInt: CMP 

recursive, passing ref 

4 min 19 s 

17 min 46 s 

4 min 21 s 

Perl 

Math: :BigInt: CMP 

iterative 

3 min 56 s 

16 min 15 s 

3 min 38 s 

Java Gnu 

java.math.Biglnteger 

recursive 

28 s 

1 min 56 s 

42 s 

Java Gnu 

java.math.Biglnteger 

iterative 

13.7 s 

54 s 

11.7s 

Java Gnu compiled -03 

java.math.Biglnteger 

recursive 

1.3 s 

4.8 s 

1.6 s 

Java Gnu compiled -03 

java.math.Biglnteger 

iterative 

6.2 s 

25.6 s 

5.9 s 

Java Sun 

java.math.Biglnteger 

recursive 

1.5 s 

5.6 s 

1.5 s 

Java Sun 

java.math.Biglnteger 

iterative 

1.4 s 

4.8 s 

1.2 s 

Haskell 

(native) 

recursive 

1.5 s 

5.8 s 

1.9 s 

Ruby 

(native) 

recursive 

1 min 45 s 

6 min 42 s 

2 min 10 s 

Ruby 

(native) 

iterative 

1 min 19 s 

4 min 50 s 

57 s 

Ruby (no pthreads) 

(native) 

recursive 

1 min 10 s 

4 min 21 s 

1 min 20 s 

Ruby (no pthreads) 

(native) 

iterative 

1 min 7 s 

3 min 59 s 

44 s 

be 

(native) 

recursive 

2 min 13 s 

8 min 40 s 

3 min 1 s 

be 

(native) 

iterative 

2 min 8 s 

7 min 28 s 

1 min 22 s 


The results for C show that the GMP library lives up to its claims of performance, at least relative to the 
comparably unoptimised code from xyssl. All runtimes for programs using GMP are smaller than the 
comparable times for xyssl, and no results with GMP are worse than any with xyssl for a given test case 
(the only tie being between gcc iterative xyssl and icc recursive GMP in the last test case). The relative 
performance of gcc and icc has changed for the xyssl implementation: It seems that gcc does a better job 
translating unoptimised code, especially when recursion is used. When GMP is used, the comparison 
turns out as for machine-sized numbers: gcc is slightly better for recursive code, icc for the iterative 
implementation. 

Perl offers the "bigint" module which allows extremely easy conversion of existing code to large integer 
arithmetic: The statement "use bigint;" at the start of the script will convert all numerical variables 
to arbitrarily large integers. However, bigint uses the module Math::BigInt::Calc by default, which is 
implemented purely in Perl and has accordingly bad performance. (The module to be used can be 
selected with a use clause argument, but this did not seem to make any difference.) "use bigint" also 
has the disadvantage that it turns all numerical variables indiscriminately into large integers, including 
the parameters n, m and u, which were left machine-sized in the other implementations. The module 
Math::BigInt::GMP, which is a Perl binding for the fast GMP library, is considerably faster, but even it 
does not come remotely close to C. Apparently the bindings add considerable overhead, compared both 
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to the C+GMP combination and to non-large integer Perl. The two different recursive implementations 
require an explanation: In the second, denoted by "recursive, ref", a reference to a Math::BigInt::GMP 
object was passed which was incremented by the function, while in the first (considerably slower) a 
large integer was created and returned in every function call. The comparison shows that much of 
the difference between the recursive and iterative implementation is accounted for by the unnecessary 
construction of Math::BigInt::GMP objects. 

The gap between Java and C has widened compared to machine-sized integers: Here the difference be- 
tween the best Java and the best C results is at least a factor of 3.4, while for 64-bit integers it was at 
most a factor of 1.9. This is likely to be at least partly due to the unavailability of an increment method 
for java.math.Biglnteger. Instead, a full addition of the large-interger constant one has to be performed 
each time a solution is found, which is bound to be more expensive than a dedicated increment method. 
Among the Java results, the Gnu bytecode interpreter performs worst, as for machine-sized integers. 
Among the other Java runtimes, the Gnu-compiled iterative version is by far the worst, being approxi- 
mately a factor four or more slower than the second worst. This contrasts with the machine-sized integer 
results, where it was the best, and can probably be explained only by some quirks of the Gnu implemen- 
tation of java.math.Biglnteger. Sun's Java bytecode interpreter with the iterative implementation is best 
now, with the Gnu-compiled recursive version comparable except for the third benchmark, whose large 
maximum recursion depth penalises recursion. 

In addition to the three programming languages which were already compared for the machine-sized 
implementations, three additional ones enter the competition here, which support arbitrarily large inte- 
gers natively: Haskell, Ruby and be. Haskell is a pure functional programming language. Because an 
iterative implementation would be convoluted, if possible at all in such a language, it was not attempted. 
In performance, Haskell ranks with Java. It is slightly slower than the recursive Java implementation 
compiled and run with the Sun tools. It should be noted that Haskell was slower by a factor of three 
when type declarations were omitted, so don't do that. Ruby outperforms Perl with GMP bindings by 
around a factor of two, but is still far from the compiled languages, be shows similar results, though a 
bit slower than Ruby. 

An oddity was encountered with Ruby. The proportion of system time of the total wallclock runtime 
was significantly higher than for other languages. A system call trace revealed that Ruby calls a system 
function (rt sigprocmask) excessively often. This has been noticed by other users (see here) but not 
resolved yet. The effect is known to occur only if the Ruby interpreter is compiled with POSIX threads 
(pthreads) support. To test that. Ruby was recompiled both with and without pthreads support, hence 
the two sets of results in the table. 


Joker in the pack - Perl and memoization 

This would be all regarding single-thread runtimes if it were not for a very special feature of Perl — 
memoization. Memoization refers to caching the results of a pure function as they are computed to 
prevent duplicate computation of these results. Perl offers the Memoize module which uses Perl's possi- 
bility to replace symbol table entries to substitute a memoized version of a function in all its calls. As for 
the bigint module, this requires no changes throughout the code, but only the "use Memoize;" statement 
and the command to memoize a specific function. In contrast to most other languages, which would 
require the manual implementation of the cache, generation of cache keys and consistent substitution of 
function calls, Perl makes it easy just to try memoization out. 

Activating memoization for the recursive Perl implementation for machine-sized integers yields the fol- 
lowing result (compared to the result without memoization). 
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Language 

Implementation 

14/100 

16/100 

22/70 

Perl 

recursive 

46 s 

3 min 6 s 

1 min 4 s 

Perl, memoized 

recursive 

0.24 s 

0.35 s 

0.47 s 

C gcc -03 

recursive 

0.36 s 

1.1 s 

0.36 s 


Memoization at a stroke brings Perl's performance up to a level better than the recursive C implementa- 
tion. 

Our result for memoization provides instructive backfeed about our algorithm. As memoization pre- 
vents duplicate computations, it is clear that the algorithm is far from optimal and in fact recomputes 
the same (sub)expressions over and over again. Let's have a brief look at why this happens. Obviously 
the function l(n,m,u) is being called with the same arguments in several different branches of the tree 
of recursions. The recursion tree is parametrised by the di (which were the number of values belonging 
to the second sample S 2 between successive values belonging to Si). So two different sets of dj have to 
lead to the same sets of arguments for a recursive call of l(n,m,u). From the equality of the m argument, 
it follows that the depth of recursion for both calls has to be the same. Taking this into account, the 
equality of the n argument entails that the sum of the dj up to that level of recursion has to be the same 
too. Finally, equality of u implies that the following sum has to be the same in both cases (where r is the 
common recursion depth): 


^ {m - i) • dj 
i = 0 


The simplest example fulfilling all three requirements is: do=0, di=2, d2=0 and do=l, di=0, d 2 =l. Both 
lead to the evaluation of l(n-5,m-3,u-2m+2). 

In order to make the algorithm optimal, one would have to work out with which different sets of ar- 
guments l(n,m,u) gets called, and then compute and store these results in a bottom-up manner until 
the result for the original arguments is obtained. While possible, this would be time-consuming and 
error-prone. The beauty of memoization is that it does this for you. 

So can memoization be combined with large integers? The greatest drawback of "use Memoize" is that 
it does not automatically work for object arguments and therefore cannot be combined with "use bigint" 
without modification. The "use bigint" statement makes all numerical variables large integers, which 
are in truth objects. As a consequence, they cannot be converted to a string as easily as normal numbers. 
Because this is what Memoize does to compare a memoized function's argument list, it has to be told 
to construct the string differently. This is done by passing a user-defined function as a parameter to the 
use clause (see the manual page of Memoize or the sources of my implementation). So the answer is, 
memoization and large integers can be combined, but not as comfortably as each of them can be enabled 
individually. 

Memoization can of course be implemented in programming languages other than Perl, though less 
comfortably. Besides Perl, only a C implementation was done for the purpose of this investigation. The 
table below shows the comparison between the large-integer memoized and non-memoized recursive 
results for Perl and C. The runtimes for the C implementation were too small to be measured with the 
time command. 
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Language 

Large integers 

Implementation 

14/100 

16/100 

22/70 

Perl 

bigint / Math::BigInt::Calc 

recursive 

2 h 42 min 

10 h 54 min 

3 h 45 min 

Perl, memoized 

bigint / Math::BigInt::Calc 

recursive 

9.1 s 

13.7 s 

18 s 

Perl 

Math: :BigInt: GMP 

recursive 

15 min 13 s 

1 h 2 min 

22 min 14 s 

Perl 

Math: :BigInt: GMP 

recursive, passing ref 

4 min 19 s 

17 min 46 s 

4 min 21 s 

Perl, memoized 

Math: :BigInt: GMP 

recursive 

0.88 s 

1.3 s 

1.8 s 

C gcc -03 

gmp 

recursive 

0.53 s 

1.9 s 

0.46 s 

C, memoized gcc -03 

gmp 

recursive 

<1 ms 

<1 ms 

<1 ms 


It is interesting to note that not only has the computation become much faster, but the runtimes have 
also changed relative to each other. In both the machine-sized and the large-integer implementation, 
the memoized version takes longest for the last test case, in contrast to the second test case for the non- 
memoized version. The reason for this is that the different test cases profit to different extents from 
memoization: the second case more than the first, and the third less. 

Some concluding remarks about memoization and caching are in order. Though all the memoized imple- 
mentations above were recursive, of course iterative implementations can likewise cache partial results. 
Only recursive memoized implementations were considered here because Perl's Memoize module works 
only for them, and our iterative implementation would have to be modified to allow caching (partial re- 
sults would have to be generated). Memoization in Haskell is also possible, but perhaps not easy in our 
case, as it requires describing the result as an element of a cyclically defined infinite list. 


Results for parallel implementations 

The increasing availability of multi-core processors and multi-processor computers means that parallel 
implementations of any algorithm are of special interest. For the purpose of this section, we will revert 
to the non-memoized version of our algorithm. Though memoization and parallelisation are not incom- 
patible, combining them presents certain difficulties: Sharing data can lead to inefficiencies if it is to be 
writable by each parallel task, which would be necessary if the memoization cache were to be shared; 
if it would not be shared, the advantage of memoization would be at least partially compensated. In 
addition, in this section, we will exclusively consider the iterative implementation in the programming 
language C with the large-integer toolkit from xyssl. This will allow us to discuss cache issues which 
might not become apparent in a machine-sized implementation and would be harder to analyse with 
the optimised GMP library. 

There are two basic ways to parallelise a program: using threads or processes. Threads share a virtual 
memory space and therefore can more easily exchange information, while processes affect each other 
less but can communicate only via sockets. Both thread and process creation involves some overhead, 
which is smaller for threads. Even for threads, though, it would not do to spawn a new thread for each 
recursive call of the function l(n,m,u). The number of parallel tasks should be small enough that their 
creation overhead does not become too large, but large enough that an approximately equal distribution 
of the computational workload on all processor cores can be ensured. 

The latter requirement is hard to meet for of our problem: The number of recursive calls, which is printed 
in the very first table above, is hard to know a priori. In fact determining it seems a quite similar problem 
to computing our end result right away, so if one could obtain a closed-form solution for it, one would 
not have to do any numberics in the first place. What is clear, however, is that of the terms of the sum 
in our recursion formula, those with small d will entail deeper secondary recursion than those with 
larger d, because both n and u are reduced by an amount which increases with d. Some experimentation 
showed that this dependence is very strong. Therefore the computation was split up into parallel tasks 
in the following way: The first-level sum over d, with do starting at 1 rather than 0 was assigned to the 
first thread or process. Then do was set to 0 and the sum over di starting from 1 was assigned to the 
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second task. And so on, until the complete sum over d t -i, for do=...=dt-2=0, was assigned to the last task 
(where t is the total number of parallel tasks). 

For the purpose of our benchmark, the computation was split up into ten parallel tasks. This was suf- 
ficient for keeping both cores of the test machine equally busy. This was verified visually by viewing a 
CPU monitor. Also, the speedup achieved could not be improved significantly by changing the number 
of parallel tasks, for any implementation. So the shortfall of speedup relative to the theoretical maxi- 
mum value of two was not due to a disadvantageous choice of the number of parallel tasks. Here are 
the results: 


Implementation 

14/100 

16/100 

22/70 

Single task 

0.64 s 

1 (Def.) 

2.5 s 

1 (Def.) 

0.53 s 

1 (Def.) 

Threads, basic 

0.5 s 

1.28 

1.8 s 

1.39 

0.4 s 

1.33 

Threads, cache opt. 

0.36 s 

1.78 

1.35 s 

1.85 

0.33 s 

1.6 

Processes, forked 

0.36 s 

1.78 

1.35 s 

1.85 

0.35 s 

1.54 

Processes, shell 

0.36 s 

1.78 

1.35 s 

1.85 

0.35 s 

1.54 


The second column for each test case gives the speedup relative to the single-task implementation, the 
results of which are repeated in the first row. (Remember that all implementations here are iterative 
implementations in C with the xyssl library, compiled with gcc -03.) The second and third rows refer to 
two different multi-threaded implementations, which are explained below. The fourth line belongs to an 
implementation which forks new processes for each sub-task instead of spawning threads. The last line 
gives the speed for running each sub-task as a separate process from the shell, in the background and 
therefore in parallel. 

The last implementation deserves some further explanation. Its purpose was to investigate whether the 
speedup of a parallel implementation can be estimated with minimal changes in the program using the 
features a command shell provides. The Program was rewritten so that in addition to the size (which 
implies n and m) and u arguments, the index of the subtask could be passed on the command line. The 
latter had otherwise been determined internally from the thread or process index. Ten processes with 
different subtask indices were then started from the shell bash(l). Each printed out the sum resulting 
from its subtask, which were then added up by hand. The first process in addition printed the binomial 
coefficient 2m over m, in order to avoid a bias against the other implementations which also do that. 

Since the time command cannot be used to measure the termination of multiple commands, a different 
means had to be used to measure the runtimes. This was done by displaying the modification time of the 
pseudoterminal device /dev/pts/<num> associated with the terminal window in which the processes 
were run, from a different terminal with the command Is -1 — full-time. This modification time is 
updated whenever something is input or output on that terminal, such as both when the processes were 
started and when they terminated. In order to get the start time, the modification time of the terminal 
device had to be displayed after the processes were started but before the first terminated. In order to 
make this less fiddly, a sleep(l) command was run prior to the subtask processes, the delay of which 
was later subtracted. So, to sum up, first the following command was run in an X terminal: 

sleep 5; { . /c_mannwhitney_part <size> <u> 0 & \ 

. /c_mannwhitney_part <size> <u> 1 & \ 

. /c_mannwhitney_part <size> <u> 8 & \ 

. /c_mannwhitney_part <size> <u> -9; } 

(The negative subtask number -9 tells the program that this is the last subtask which should loop over 
d 9 starting from 0.) Within the next five seconds, the modification time of the associated pseudoterminal 
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device was displayed in another terminal. After all the processes had terminated, the modification time 
was displayed again, and the runtime computed as the difference between both times minus 5 seconds. 

As can be seen from the table, the optimised thread implementation (see below), the forked implementa- 
tion and the shell-based multi-process computation are on the whole equally fast, with a small difference 
only in the last test case. In particular, the shell-based parallel execution performed exactly like the im- 
plementation forking processes in C. This means a benchmark using a command shell and a program 
which can perform any subtask separately can be a guide to the efficiency of a parallel implementation. 
It may also be an option for a "production run" if obtaining the end result from the subtask results is not 
too much trouble. 

It still remains to explain the difference between the two threaded implementations and the reason 
for their differing performance. The "basic" implementation is a straightforward parallelisation of the 
single-tasked C code. The parameters, thread number and large integer variable for the subtask result 
are stored in a struct assigned to each thread. The xyssl library is relied upon to automatically allocate 
space for the large integer as needed. Space for the backtracking stack of the iterative implementation is 
allocated by the program as usual. To see what is wrong with that, look at the following table giving the 
runtimes of variants of the optimised implementation. In this implementation, sufficient space for the 
large integer was made available as part of each thread's struct, and all space allocated on the heap was 
padded by a certain amount of bytes. The following table shows the dependence of the runtime for the 
second test case (16/100) on the padding: 


Padding [bytes] 

0 

8 

16 

24 

32 

40 

48 

56 

64 

Runtime [s] 

2.44 

2.71 

1.85 

1.53 

1.41 

2.06 

1.43 

1.37 

1.39 

RMS deviation [s] 

0.3 

0.24 

0.21 

0.05 

0.03 

0.33 

0.06 

0.01 

0.04 


For this comparison, a sufficient but small eight bytes were preallocated for each partial result. The pro- 
gram was run five times for each padding. The table shows both the runtime and its standard deviation. 
For small amounts of padding, the program is inefficient; it runs more slowly than the single-task im- 
plementation for eight bytes of padding. Besides, variations between runs (expressed in the standard 
deviation) are large. The runtimes decrease and almost reach the optimum for a padding of 32 bytes, but 
then rise sharply again! For a padding of 48 bytes or larger, performance is again near-optimal and vari- 
ations are small. (The larger preallocation of 64 bytes causes performance to improve somewhat more, 
leading to the results in the previous table.) 

So where do the padding-dependent and erratically varying runtimes come from? This is a cache effect. 
Different processor cores share a second-level cache, but not their first-level cache. CPU caches are 
divided up into cache lines, which are the units in which caches are organised. When one core writes 
to a cache line which is also in a different core's level-1 cache, that core has to invalidate it and reload it 
from the second-level cache. Threads share the same virtual memory space and therefore compete for 
write access to cache lines unless their data is separated by a sufficient amount of space. Our program 
uses only small amounts of memory, but it does write to the large integer allocated to the result quite 
often: every time a solution with a small enough U is found, the sum is incremented. This implies that a 
thread's data can normally be kept in the first-level cache, unless there is a conflict caused by two threads 
trying to access the same cache line. As a consequence, such a conflict carries a heavy penalty. 

The cache line size of most of today's processors is 64 bytes. So a padding of 64 bytes is sufficient to 
ensure that thread-specific data ends up in a different cache line for each thread. (As it happens, the 
program allocates all threads' data with a single malloc. Also, the sub-struct containing auxiliary data 
for the large integer result is at the start of the data, while the space for the actual numerical data is at 
the end. Changing either of these two factors may also result in an improvement.) The good results 
for a padding of about half the cache line size (24 and 32 bytes) may result from a majority of thread 
data structs being located in different cache lines. "Odd" amounts of padding (those which have a 
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small greatest common divisor with the cache line size) cause alignments to vary and therefore make it 
probable that at least some structs will share a cache line. Cache effects are complex, and the interested 
reader is referred to this excellent article. 


Discussion and conclusions 

The previous sections of this report have compared various implementations of a combinatorial counting 
problem with a constraint. Some of the results are quite clear and partly confirm what has been found 
or suspected previously. Compiled languages (C, Java and Haskell) took significantly less computation 
time, of the order of seconds, compared to interpreted scripting languages (Perl, Ruby and be), which 
took of the order of minutes. Java offered performance comparable to C, and those two languages were 
the fastest overall (though the Gnu Java interpreter should be avoided). The Gnu large integer library 
GMP was significantly faster than a straightforward unoptimised large integer implementation provided 
by the xyssl library. 

The programming language Haskell was a great help in developing the algorithm due to its mathemat- 
ical notation. It also offers good performance, though not comparable to C with the GMP library. The 
greatest surprise was the huge benefit which Perl's memoization module offered. Especially for recur- 
sive problems where duplicate computations are hard to avoid in developing the algorithm, the ease 
with which Perl allows to try it out makes it a sensible choice at least for an exploratory implementation. 

The results for parallelised implementations in C suggested that the performance of a multithreaded 
implementation can be severly degraded unless subtle effects due to memory caching are taken into 
account. A multi-process (forked) implementation does not suffer from that drawback and is therefore 
probably preferable unless sharing large amounts of data is required. It was also shown that the perfor- 
mance of a parallel implementation can be reliably estimated by creating a separate program for every 
subtask and running them in parallel from a UNIX shell. 

Though the results of this investigation can probably be generalised at least to some extent, its limits 
have to be pointed out: 

• The problem chosen for this comparison required only integer arithmetic, so no inference can be 
drawn about floating-point performance. Also, not all integer computation problems may show 
exactly the same picture. 

• Test coverage of the large-integer libraries was small: The resulting numbers were small enough 
for machine-sized integers, so no great on-demand allocations had to be performed by the libraries. 
Also, the only operations performed on the large-integer variables were initialisation, incrementa- 
tion and addition, no more complex arithmetic. 


• Most of the compilers and interpreters were the versions which came with my distribution, and 
therefore not up to date. The exceptions were Haskell and Ruby. This should not influence the 
results too much, as most are mature languages. It might however make a difference in the case of 
Java, notably gij's performance. 

• Parallel implementations were tested on a processor with only two cores, so the benefit of having 
more than two cores is unknown. 

• I am more familiar with some programming languages than with others, which may have caused 
a slight bias. 
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Before drawing the conclusions, one more factor should be looked at. Besides the performance of a pro- 
gram written in a given programming language and the effort necessary to code in that language, it is 
also important how difficult it is to achieve optimal results for that language. Plain old C is clearly king in 
this respect: It is too low-level to hide details from the programmer which might influence performance, 
in contrast to object-oriented languages, for example. The worst implementation in this respect was the 
multi-threaded parallel one. Though also programmed in C, it was heavily affected by the details of 
memory cacheing. Only an expert on the processor used could write such a program in an optimal way 
without time-consuming trial and error. But also Perl was not free of some erratic features: In some of 
the Perl implementation, the map command proved faster than a for loop, in others it consumed large 
amounts of memory, apparently because the range of di to be looped over was sometimes created as an 
array and sometimes not. In Java, the use of a stack object was slightly slower than a more C-like im- 
plementation manually managing the stack (which was used for the benchmarks). Ruby's performance 
was reduced by a spurious system call being made excessively often; though this issue has been recently 
been recognised by the Ruby community and is likely to be resolved. In Haskell, it seems to improve 
performance to declare functions' type. Though this is not enforced by the compiler, it is good practice 
anyway. 

Finally, I want to make a recommendation to others seeking to solve a similar combinatorial problem. 
Haskell's mathematical notation and clarity make it a good choice for looking at the problem from sev- 
eral angles and developing an algorithm to solve it. Its performance is also quite tolerable, so nothing 
more needs to be done if it is quick enough. If this first implementation is not sufficiently fast, an imple- 
mentation in Perl should be done to find out whether memoization brings any benefit. If a memoized 
Perl implementation is still too slow, it should be rewritten in C. If a multi-processor machine is to be 
used to obtain still more performance, a division of the problem into subtasks can be evaluated most eas- 
ily by running them as separate programs from a shell. A final parallel implementation should best be 
multi-process rather than multi-threaded, perhaps using the standard Message Passing Interface (MPI), 
which was not used in this work but is widely used in the scientific computing community. Clearly, 
this recipe need not be followed step by step, but it leaves the most time-consuming implementations to 
last and puts those from which others benefit first. Adapt the procedure according to the problem to be 
solved and your preferred programming languages. 

Source code of all implementations 

The source code of the implementations in all languages which were benchmarked can be downloaded 
in the following TAR file: 

Download 

Update 2012 

In the last few years I have come into contact with some more programming languages. I have pro- 
grammed the Mann-Whitney counting problem in these too; see below for the results. The languages 
and compiler /interpreter versions are the following: 


Erlang 

R14B01, erts-5.8.2 

D 

DMD v2.059 -O -release 

Go 

1 . 0.1 

Python 

2.6.5 


Erlang is a distributed functional programming language that has been used to program high-reliability 
routers in telecommunications. It natively supports creation of "processes" inside the interpreter and 


rw 
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message passing. A variant called HIPE (for "High Performance Erlang") allows incorporating machine 
code on x86 processor architectures; otherwise the source is compiled to bytecode and interpreted. D 
is a compiled object-oriented language similar to C and C++, but with automatic variable initialisation, 
garbage collection and native data structures such as associative arrays. Go is a compiled language with 
a syntax between C and scripting languages like Python. It supports concurrency as well as object- 
oriented features. Last, Python is one of the big three scripting languages next to Perl and Ruby. I had 
omitted it previously because I did not know it very well, and have now added it for completeness. 

The table below shows the benchmark results. The computer on which the benchmarks were run is the 
same, so the results are comparable to those above. Some results for C are shown for comparison. 


Language 

Integer type 

Implementation 

14/100 

16/100 

22/70 

C gcc -03 

64 bits 

recursive 

0.36 s 

1.1 s 

0.36 s 

C gcc -03 

64 bits 

iterative 

0.27 s 

0.94 s 

0.15 s 

C gcc -03 

gmp 

recursive 

0.53 s 

1.9 s 

0.46 s 

Erlang 

native large integers 

recursive 

8.2 s 

29 s 

8.5 s 

Erlang, HIPE 

native large integers 

recursive 

5.5 s 

18.6 s 

4.2 s 

Erlang 

native large integers 

recursive, parallel 

4.6 s 

15.9 s 

4.3 s 

Erlang, HIPE 

native large integers 

recursive, parallel 

3.15 s 

10.3 s 

2.3 s 

D 

64 bits 

recursive 

0.42 s 

1.6 s 

0.46 s 

D 

64 bits 

iterative 

0.20 s 

0.73 s 

0.13 s 

D 

64 bits 

recursive, memoized 

« 0.016 s 

» 0.024 s 

« 0.016 s 

D 

std.bigint, passing ref 

recursive 

1 min 31 s 

5 min 46 s 

1 min 47 s 

Go 

64 bits 

recursive 

0.50 s 

1.8 s 

0.50 s 

Go 

64 bits 

recursive, memoized 

» 0.01 s 

» 0.01 s 

» 0.004 s 

Go 

math/big, passing ref 

recursive 

35.8 s 

1 min 58 s 

26 s 

Python 

64 bits (int) 

recursive 

27 s 

1 min 45 s 

31 s 

Python 

large integers (long) 

recursive 

53 s 

3 min 18 s 

63 s 


The speed of Erlang lies between that of compiled and interpreted languages. Even though it is based 
on bytecode, it cannot compete with Java for efficiency. Erlang, even with HIPE, is not intended as a 
high-performance language, and it shows. But it is still an order of magnitude faster than a scripting 
language. It syntax is clear and minimal, similar to Haskell, and its language features allow parallelisa- 
tion with minimum effort. That makes it suitable for exploratory implementations of problems that lend 
themselves to parallelisation. 

D and Go both offer speeds within a factor of 2 of C, as one would expect from compiled languages. 
Go's associative array implementation seems to be more efficient than D's, judging from the results 
with memoization. The large integer packages of both languages are ridiculously slow, in the case of D 
comparable to the scripting language Ruby. If you have an application that requires large integers, you 
will be better off trying to link to the GMP library (both languages offer the possibility of linking with C 
code). As for Go, a parallel implementation seemed to show no speedup and was running all on a single 
core, for reasons I do not understand. 

Python was a pleasant surprise. It beats both Perl and Ruby for 64-bit and large integers, respectively. 
That makes it a better choice than those other two for quick computations. 

Download source code in these programming languages 
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• The gnu compiler collection (gcc) homepage 

• The Intel C compiler (icc) homepage 

• Perl can be obtained from the Perl homepage, supplementary modules from CPAN 

• The Gnu Java compiler and interpreter can be found on the Gnu Java home page 

• Sun's Java home page 

• The xyssl library which contains one of the arbitrary-precision integer toolkits used in this com- 
parison seems to be defunct; the relevant source files are included in the source code package of 
this report 

• Haskell resources can be found at the Haskell homepage, as can the Glasgow Haskell Compiler 
(GHC) 

• Mark J. Dominus: Higher-Order Perl. Morgan Kaufmann 2005. ISBN 1558607013. Describes func- 
tional programming with Perl, among others memoization. 

• What every programmer should know about memory, a nine-part series of highly recommended 
articles. Part 2 deals with CPU caches. 

• The formulas for this web page were generated using gladTeX, which I can hotly recommend 
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